Use this template to complete your project throughout the course. Your Final Project presentation will be based on the contents of this document. Replace the title/name above and text below with your own, but keep the headers. Feel free to change the theme and other display settings, but this is not required.

Overview

#Give a brief a description of your project and its goal(s), what data you are using to complete it, and what two faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository. Social determinants of health, such as inadequate housing, lack of access to reliable transportation, and insufficient food access, have increasingly been recognized as being factors that adversely impact health outcomes (https://health.gov/healthypeople/priority-areas/social-determinants-health). In this project, with guidance from advisors John Holmes, PhD and Sameh Saleh, MD, I will attempt to determine if there is an observed association between low food access and type 2 diabetes prevalence in the United States for the most recent year for which data about both were available, 2015. It is possible that there is a positive association between decreased food access and type 2 diabetes prevalence, as individuals in low food access areas may have little choice but to rely on convenience foods and fast foods for their nutritional intake, which is likely to increase the risk of developing type 2 diabetes. I will use the AHRQ Social Determinants of Health dataset for this work.

Link to final project Github repository: https://github.com/mcgreevj/BMIN503_Final_Project

Introduction

#Describe the problem addressed, its significance, and some background to motivate the problem.

#Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff. Type 2 diabetes is a major health problem in the United States, with significant morbidity and mortality for the at least 33 million people who suffer with this condition. Additionally, the economic costs of type 2 diabetes in the aggregate are substantial, totaling $327 billion in 2017 (The Cost of Diabetes | ADA). Tragically, type 2 diabetes is largely a preventable disease that has modifiable risk factors (weight, diet, exercise). Increasingly, there has been recognition that social determinants of health can serve as barriers that prevent individuals from being able to successfully modify risk factors. For example, individuals who lack access to insurance and/or reliable transportation may not be able to seek routine medical care for education, prevention, monitoring, and treatment purposes. As noted, diet is a major factor in the development of type 2 diabetes. And while ideal diets to prevent type 2 diabetes have been advised, not every American necessarily has access to food to be able to consume these recommended diets. This is the phenomenon of food insecurity. Food insecurity is formally defined by the US Department of Agriculture as: “the limited or uncertain availability of nutritionally adequate and safe foods, or limited or uncertain ability to acquire acceptable foods in socially acceptable ways.” (USDA ERS - Measurement). One factor that can lead to lack of access to healthy foods is the phenomenon of food deserts – areas that are low income and that have few or no supermarkets or other sources of healthy food options. As formally defined by the USDA, a food desert is: “A census tract that meets both low-income and low-access criteria including: 1. poverty rate is greater than or equal to 20 percent OR median family income does not exceed 80 percent statewide (rural/urban) or metro-area (urban) median family income; 2. at least 500 people or 33 percent of the population located more than 1 mile (urban) or 10 miles (rural) from the nearest supermarket or large grocery store.” In the absence of options for healthy food, individuals residing in these communities may need to resort to convenience and fast foods for some or many of their meals. In doing so, they increase the risk of developing type 2 diabetes.

This project is interdisciplinary in that it combines census tract, geographical data with health survey data. Notably the AHRQ SDOH database integrates data from multiple surveys and sources. The fields and topics that contribute to understanding of this issue are multiple: public health, clinical medicine, sociology, health disparities, economics, geography, history (in that the reasons many communities have limited access to food involves historical structural decisions, including forms of structural racism such as redlining). Moreover, the findings of this project have social justice, ethical, cultural, economic, and political implications. Based on collaboration with my mentors, I have learned the importance of familiarizing myself with data sources and data definitions, notably the Data Source Documentation guide from AHRQ (AHRQ Social Determinants of Health (SDOH) Database). This step is necessary from an exploratory standpoint to contemplate what kind of project I might do and also from a practical standpoint to understand how to properly and specifically answer my project question with the available data. I have also learned that some questions may be challenging to answer exactly as initially envisioned. For example, when considering this project at first, I had hoped to show changes over time in the association between limited food access and type 2 diabetes prevalence, perhaps including recent data from 2020 or 2021. However, after close inspection of this SDOH data set, I have realized that not all data is available for every year of interest. Thus I have needed to narrow the scope of my project a bit, given that data for limited food access and diabetes prevalence intersect in 2015, which is the most recent year that both types of data are available.

The identification of an independent association between low food access and type 2 diabetes prevalence would help support the case for policies and programs (both at the national and state levels) to improve food access in impacted census tracts. From a social justice standpoint, those with power in society (elected office holders, business and community leaders, members of the health care community including researchers) have a responsibility to seek ways to provide a known and effective preventive therapy (whether a vaccine or a grocery store) to those without such access currently, to prevent disease. There is also a national public health and cost dimension to this topic. People with diabetes suffer complications and morbidity related to diabetes, impairing their ability to support themselves, live independent lives, and contribute to a community’s economic activity and well-being. At the same time, the national cost of providing care for individuals with type 2 diabetes is borne by all Americans, in part because individuals living in low access food areas are more likely to have public (Medicaid, Medicare) insurance, if they have insurance at all, that is funded by all U.S. taxpayers. Indeed, Medicaid and Medicare combined provide health coverage to approximately 36% of Americans today and that percentage continues to rise over time. (Health Insurance Coverage of the Total Population | KFF). Data Note: Three Findings about Access to Care and Health Outcomes in Medicaid (kff.org). “Uncompensated” care for individuals with type 2 diabetes who lack any type of insurance is also a cost borne by individual health care organizations, governments, and ultimately, tax payers. Greater awareness of the relationships between SDOH and health outcomes, combined with a blunt accounting of the costs of inaction on this subject, will hopefully spur actions that can help mitigate SDOH and improve both individual and community health.

Methods

For this project, I used the AHRQ Social Determinants of Health database (available at https://www.ahrq.gov/sdoh/data-analytics/sdoh-data.html). I reviewed the data source documentation document (https://www.ahrq.gov/sites/default/files/wysiwyg/sdoh/SDOH-Data-Sources-Documentation-v1-Final.pdf) to learn about potential variables for this project and their availability. Ultimately, I planned to use two main variables available in the SDOH database: CHR_PCT_DM and FARA_TRACT_LILA_HALF_10.

CHR_PCT_DM is defined as the “percentage of adults with diagnosed diabetes (ages 20 and over).” Data for this variable is provided by the CDC Interactive Diabetes Atlas which in turn uses BRFSS data to estimate prevalence by U.S. county. Notably, starting in 2015, BRFSS included both landline and cellphone survey data to obtain diabetes prevalence data. Previously, this data had been collected exclusively via landline surveys.

FARA_PCT_LA_HALF_10_POP is defined as the “percentage of [healthy food] low access population measured at 1/2 mile for urban areas and 10 miles for rural areas.” For this variable, low access to healthy food is defined as being far from a supermarket, supercenter, or large grocery store, so more than 1/2 mile in urban areas and more than 10 miles in rural areas for this variable. #Does low access first require low-income designation for a tract, or is low access independent of low-income status for a tract? That is, a tract could be LA and not be LI?

This variable comes from the Food Access Research Atlas (FARA), created by the Economic Research Service (ERS) at the United States Department of Agriculture. A mapping tool is offered by the ERS and is available here: https://www.ers.usda.gov/data-products/food-access-research-atlas/go-to-the-atlas/.

In contrast to the diabetes prevalence data, FARA data is provided at the census tract level.

Of note, the following is the ERS methodology for calculating distance to the nearest grocery store, excerpted from the ERS definitions document available at https://www.ers.usda.gov/data-products/food-access-research-atlas/documentation/:

“First, the entire country is divided into ½-kilometer (about 1/3-mile)-square grids, and data on the population are aerially allocated to these grids (see Low-Income and Low-Foodstore-Access Census Tracts, 2015-19 in the link below). Distance to the nearest supermarket is measured for each grid cell by calculating the distance between the geographic center of the ½-kilometer square grid that contains estimates of the population (number of people and other subgroup characteristics) and the center of the grid with the nearest supermarket.”

While I had hoped to be able to include recent data for this project, the most recent year that data for both CHR_PCT_DM and FARA_PCT_LA_HALF_10_POP is available is 2015.

After having consulted with my advisors (Dr. Holmes and Dr. Saleh) and choosing the key variables above, I loaded the appropriate datasets into R.

For CHR_PCT_DIABETES, the 2015 data (by county) was available as an Excel file here: https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fwww.ahrq.gov%2Fsites%2Fdefault%2Ffiles%2Fwysiwyg%2Fsdoh%2FSDOH_2015_COUNTY_1_0.xlsx&wdOrigin=BROWSELINK.

For FARA_PCT_LA_HALF_10_POP, the 2015 data (by tract) was available as an Excel file here: https://www.ahrq.gov/downloads/sdoh/sdoh_2015_tract_1_0.xlsx.

Once loaded into R, I manipulated each of these raw datasets so as to limit them to the variables of interest.

For the tract dataset, I mutated it to create a new variable, MEAN_PCT_LA_COUNTY, that reported the mean low access percentage by county, using group by and summarize functions.

I joined the dataframes containing CHR_PCT_DIABETES and FARA_PCT_LA_HALF_10_POP, respectively, by COUNTYFIPS, which was present in both dataframes.

In preparation for an exploratory data analysis using choropleths, I loaded census county files using practicum code as a guide.

I created a basic U.S. choropleth of low access percentage by county. Similarly, I created abasic U.S. choropleth of diabetes prevalence by county.

I will plan to create a logistic regression model to determine if there is any relationship between low food access and diabetes prevalence. [Is there any way to graphically demonstrate a relationship]

I will consider adjusting the model for other potentially relevant variables such as income, age, obesity if possible.

Results

Result of exploratory data analysis Result of logistic regression model Result after adjusting for other variables [preliminary code below]

Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

Limitations There was not an evident way to capture purely type 2 diabetes data from available surveys. Interestingly, the CHR_PCT_DIABETES variable counts any diagnosis of diabetes, whether type 1, type 2, or other. In this project, the diabetes of interest is type 2 diabetes given that type 2 diabetes is associated with modifiable lifestyle factors. Still, given that 90-95% of all U.S. diabetes cases are type 2 (https://www.cdc.gov/diabetes/basics/type2.html#:~:text=Healthy%20eating%20is%20your%20recipe,adults%20are%20also%20developing%20it), this project is still capable of providing a reasonably strong estimate of any association between low access and type 2 diabetes prevalence.

Conclusion

[Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.]

library("readxl")
#Read in the 2015 low access data, reported by census tract
Lowaccess_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.sdoh_2015_tract_1_0.xlsx")
dim(Lowaccess_raw)
## [1] 74133   405
#Read in the 2015 diabetes prevalence data, reported by county
Diabetes_raw<- read_excel("C:\\Users\\McGreevJ\\Desktop\\BMI 5030 Data Science\\BMIN503_Final_Project\\Adjusted datasets\\data.SDOH_2015_COUNTY_1_0.xlsx")
## Warning: Expecting logical in RJ1671 / R1671C478: got '46123'
## Warning: Expecting logical in RJ1763 / R1763C478: got '32510'
## Warning: Expecting logical in RK1763 / R1763C479: got '41025'
## Warning: Expecting logical in RL1763 / R1763C480: got '41037'
## Warning: Expecting logical in RJ2797 / R2797C478: got '49017'
## Warning: Expecting logical in RK2797 / R2797C479: got '49019'
## Warning: Expecting logical in RL2797 / R2797C480: got '49025'
## Warning: Expecting logical in RM2797 / R2797C481: got '49055'
## Warning: Expecting logical in RJ2842 / R2842C478: got '51760'
dim(Diabetes_raw)
## [1] 3234  979
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.0      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(cowplot)
library(dplyr)

#Limiting Lowaccess_raw to include variables of most interest:

Lowaccess<- dplyr::select(Lowaccess_raw, TRACTFIPS, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, FARA_TRACT_LILA_HALF_10, FARA_PCT_LA_HALF_10_POP, HRSA_MUA_CENSUS_TRACT, POS_DIST_ED_TRACT, CEN_AIAN_NH_IND)
str(Lowaccess) #74,133 x 25
## tibble [74,133 × 25] (S3: tbl_df/tbl/data.frame)
##  $ TRACTFIPS                   : chr [1:74133] "01001020100" "01001020200" "01001020300" "01001020400" ...
##  $ COUNTYFIPS                  : chr [1:74133] "01001" "01001" "01001" "01001" ...
##  $ STATEFIPS                   : chr [1:74133] "01" "01" "01" "01" ...
##  $ STATE                       : chr [1:74133] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ COUNTY                      : chr [1:74133] "Autauga County" "Autauga County" "Autauga County" "Autauga County" ...
##  $ REGION                      : chr [1:74133] "South" "South" "South" "South" ...
##  $ ACS_TOT_POP_WT              : num [1:74133] 1948 2156 2968 4423 10763 ...
##  $ ACS_MEDIAN_AGE              : num [1:74133] 38.8 35.1 40.2 39.7 34.2 32.8 36.3 43.5 37.1 37.5 ...
##  $ ACS_PCT_EMPLOYED            : num [1:74133] 94.6 86.7 93.8 89.2 95.8 ...
##  $ ACS_MEDIAN_HH_INC           : num [1:74133] 61838 32303 44922 54329 51965 ...
##  $ ACS_PCT_HH_INC_10000        : num [1:74133] 2.3 10.29 8.68 1.05 4.15 ...
##  $ ACS_PCT_HH_INC_100000       : num [1:74133] 19.66 13.58 9.46 17.31 23.53 ...
##  $ ACS_PER_CAPITA_INC          : num [1:74133] 25713 18021 20689 24125 27526 ...
##  $ ACS_PCT_POV_AIAN            : num [1:74133] 0 NA 0 0 NA NA NA 100 NA NA ...
##  $ ACS_MEDIAN_HOME_VALUE       : num [1:74133] 149100 97900 110600 134700 174300 ...
##  $ ACS_PCT_MEDICAID_ANY        : num [1:74133] 16.12 24.49 12.13 4.49 5.65 ...
##  $ ACS_PCT_MEDICAID_ANY_BELOW64: num [1:74133] 17.52 25.38 12.7 4.94 6.35 ...
##  $ ACS_PCT_MEDICARE_ONLY       : num [1:74133] 4.83 3 3.41 3.45 2.2 4 3.48 5.19 3.98 3.99 ...
##  $ ACS_PCT_PRIVATE_ANY         : num [1:74133] 57.6 42.9 49.7 58.1 66.2 ...
##  $ ACS_PCT_UNINSURED           : num [1:74133] 11.55 14.08 11.18 15.77 5.89 ...
##  $ FARA_TRACT_LILA_HALF_10     : num [1:74133] 0 0 0 0 0 0 1 0 0 0 ...
##  $ FARA_PCT_LA_HALF_10_POP     : num [1:74133] 90.6 65 82 83.2 72.2 ...
##  $ HRSA_MUA_CENSUS_TRACT       : chr [1:74133] "0" "0" "0" "0" ...
##  $ POS_DIST_ED_TRACT           : num [1:74133] 2.23 1.37 0.83 0.55 1.59 ...
##  $ CEN_AIAN_NH_IND             : chr [1:74133] "0" "0" "0" "0" ...
colnames(Lowaccess)
##  [1] "TRACTFIPS"                    "COUNTYFIPS"                  
##  [3] "STATEFIPS"                    "STATE"                       
##  [5] "COUNTY"                       "REGION"                      
##  [7] "ACS_TOT_POP_WT"               "ACS_MEDIAN_AGE"              
##  [9] "ACS_PCT_EMPLOYED"             "ACS_MEDIAN_HH_INC"           
## [11] "ACS_PCT_HH_INC_10000"         "ACS_PCT_HH_INC_100000"       
## [13] "ACS_PER_CAPITA_INC"           "ACS_PCT_POV_AIAN"            
## [15] "ACS_MEDIAN_HOME_VALUE"        "ACS_PCT_MEDICAID_ANY"        
## [17] "ACS_PCT_MEDICAID_ANY_BELOW64" "ACS_PCT_MEDICARE_ONLY"       
## [19] "ACS_PCT_PRIVATE_ANY"          "ACS_PCT_UNINSURED"           
## [21] "FARA_TRACT_LILA_HALF_10"      "FARA_PCT_LA_HALF_10_POP"     
## [23] "HRSA_MUA_CENSUS_TRACT"        "POS_DIST_ED_TRACT"           
## [25] "CEN_AIAN_NH_IND"
#Lowaccess %>% dplyr::summarise(count(COUNTYFIPS))
#This gives a table of the number of tracts per COUNTYFIPS

#Trying a couple of things here to get an accurate county count for low access:
#1
#Lowaccess %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #74133 COUNTYFIPS codes
#did not run second time I ran it

#2
#Lowaccess %>% dplyr::summarise(count(COUNTY))  #1968 counties?
#did not run second time I ran it

#Limiting Diabetes_raw to include variables of most interest:
Diabetes<- dplyr::select(Diabetes_raw, COUNTYFIPS, STATEFIPS, STATE, COUNTY, REGION, ACS_TOT_POP_WT, ACS_MEDIAN_AGE, ACS_PCT_EMPLOYED, ACS_MEDIAN_HH_INC, ACS_PCT_HH_INC_10000, ACS_PCT_HH_INC_100000, ACS_PER_CAPITA_INC, ACS_PCT_POV_AIAN, ACS_MEDIAN_HOME_VALUE, ACS_PCT_MEDICAID_ANY, ACS_PCT_MEDICAID_ANY_BELOW64, ACS_PCT_MEDICARE_ONLY, ACS_PCT_PRIVATE_ANY, ACS_PCT_UNINSURED, CEN_AIAN_NH_IND, ACS_TOT_HH, ACS_AVG_HH_SIZE, ACS_PCT_CHILD_1FAM, CHR_PCT_ADULT_OBESITY, CHR_PCT_DIABETES)
str(Diabetes) #3234 x 25
## tibble [3,234 × 25] (S3: tbl_df/tbl/data.frame)
##  $ COUNTYFIPS                  : chr [1:3234] "01001" "01003" "01005" "01007" ...
##  $ STATEFIPS                   : chr [1:3234] "01" "01" "01" "01" ...
##  $ STATE                       : chr [1:3234] "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ COUNTY                      : chr [1:3234] "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ REGION                      : chr [1:3234] "South" "South" "South" "South" ...
##  $ ACS_TOT_POP_WT              : num [1:3234] 55221 195121 26932 22604 57710 ...
##  $ ACS_MEDIAN_AGE              : num [1:3234] 37.7 42.2 38.8 38.9 40.7 39.3 40.5 39.1 43 45.4 ...
##  $ ACS_PCT_EMPLOYED            : num [1:3234] 92.4 92.5 82.3 91.7 92.3 ...
##  $ ACS_MEDIAN_HH_INC           : num [1:3234] 51281 50254 32964 38678 45813 ...
##  $ ACS_PCT_HH_INC_10000        : num [1:3234] 6.14 6.36 13.5 7.74 7.77 ...
##  $ ACS_PCT_HH_INC_100000       : num [1:3234] 19.3 19.94 9.16 9.82 12.88 ...
##  $ ACS_PER_CAPITA_INC          : num [1:3234] 24974 27317 16824 18431 20532 ...
##  $ ACS_PCT_POV_AIAN            : num [1:3234] 52.2 9.15 0 4.65 6.6 ...
##  $ ACS_MEDIAN_HOME_VALUE       : num [1:3234] 141300 169300 92200 102700 119800 ...
##  $ ACS_PCT_MEDICAID_ANY        : num [1:3234] 12.3 12.3 25.2 18.4 15.4 ...
##  $ ACS_PCT_MEDICAID_ANY_BELOW64: num [1:3234] 13.2 14.4 27.6 20.3 16.8 ...
##  $ ACS_PCT_MEDICARE_ONLY       : num [1:3234] 3.86 6.42 5.31 6.89 5.82 3.54 7.28 4.78 7.51 6.43 ...
##  $ ACS_PCT_PRIVATE_ANY         : num [1:3234] 59.2 59.1 43.6 56.1 58.4 ...
##  $ ACS_PCT_UNINSURED           : num [1:3234] 10.12 12.96 15.51 9.66 11.64 ...
##  $ CEN_AIAN_NH_IND             : chr [1:3234] "0" "0" "0" "0" ...
##  $ ACS_TOT_HH                  : num [1:3234] 20396 74104 9222 7027 20816 ...
##  $ ACS_AVG_HH_SIZE             : num [1:3234] 2.68 2.6 2.61 2.95 2.74 2.73 2.49 2.52 2.44 2.28 ...
##  $ ACS_PCT_CHILD_1FAM          : num [1:3234] 22.3 22.5 53.4 26 25.2 ...
##  $ CHR_PCT_ADULT_OBESITY       : num [1:3234] 37.5 31 44.3 37.8 34.4 39.4 40.2 37.1 40.3 36.3 ...
##  $ CHR_PCT_DIABETES            : num [1:3234] 14.2 11.3 18 14.9 14.3 20 17.9 16.2 17 13.5 ...
#Diabetes %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #3234 COUNTYFIPS codes (or 3234 unique counties)
#did not run the second time I ran it

#Still need to understand why the apparent difference in county counts above between Low access and Diabetes datasets - #suggestions welcome
#Transform low access data that is provided by *census tract* into a dataframe "Lowaccess_county" where FARA_PCT_LA_HALF_10_POP is summarised as a mean value for each county, omitting NA values.

#Possibly instead of mean, I could use a weighted mean - will still need to work through how to do that properly

Lowaccess_county<- Lowaccess %>% filter(!is.na(FARA_PCT_LA_HALF_10_POP)) %>% dplyr::group_by(COUNTYFIPS) %>% dplyr::summarise(MEAN_PCT_LA_COUNTY = mean(FARA_PCT_LA_HALF_10_POP))
head(Lowaccess_county)
## # A tibble: 6 × 2
##   COUNTYFIPS MEAN_PCT_LA_COUNTY
##   <chr>                   <dbl>
## 1 01001                   63.2 
## 2 01003                   42.1 
## 3 01005                   31.1 
## 4 01007                    1.32
## 5 01009                   11.1 
## 6 01011                   71.4
#Lowaccess_county %>% dplyr::summarise(count(COUNTYFIPS)) %>% dplyr::summarise(sum(freq)) #3140 COUNTYFIPS codes
#did not run the second time I ran this

str(Lowaccess_county) #3140 variables
## tibble [3,140 × 2] (S3: tbl_df/tbl/data.frame)
##  $ COUNTYFIPS        : chr [1:3140] "01001" "01003" "01005" "01007" ...
##  $ MEAN_PCT_LA_COUNTY: num [1:3140] 63.25 42.07 31.1 1.32 11.07 ...
str(Lowaccess_county$COUNTYFIPS) #3140 values
##  chr [1:3140] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
str(Diabetes$COUNTYFIPS) #3234 values 
##  chr [1:3234] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Related to question on line 130, why the differences in countyfips counts between Lowaccess_county and Diabetes dataframes?

#Omitting any na values
Lowaccess_county<- na.omit(Lowaccess_county)
Diabetes<- na.omit(Diabetes)

str(Lowaccess_county$COUNTYFIPS) #3140 values, chr
##  chr [1:3140] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
str(Diabetes$COUNTYFIPS) #3234 values, chr; but another time I ran it there are only 2927 values, #why the difference?
##  chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Still a difference in countyfips counts, why the difference?
#Joining Diabetes and Lowaccess_county dataframes 

#Notes for thinking about this:
#How many counties are in DM
#How many counties are in Low Access?
#Which counties are different?
#Consider an inner join
#county.x, county.y means that county columns for two different dataframes were joined and both preserved, now distinguished with ".x" or ".y"
#remember that in a join, columns are preserved but row contents may not be


#Which of the following, if either, should I use?

#Innerjoin?
test_innerjoin<- inner_join(Diabetes, Lowaccess_county, by = "COUNTYFIPS")
head(test_innerjoin)
## # A tibble: 6 × 26
##   COUNTYFIPS STATE…¹ STATE COUNTY REGION ACS_T…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
##   <chr>      <chr>   <chr> <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 01001      01      Alab… Autau… South    55221    37.7    92.4   51281    6.14
## 2 01003      01      Alab… Baldw… South   195121    42.2    92.5   50254    6.36
## 3 01005      01      Alab… Barbo… South    26932    38.8    82.4   32964   13.5 
## 4 01007      01      Alab… Bibb … South    22604    38.9    91.7   38678    7.74
## 5 01009      01      Alab… Bloun… South    57710    40.7    92.3   45813    7.77
## 6 01011      01      Alab… Bullo… South    10678    39.3    82.0   31938   18.9 
## # … with 16 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## #   ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## #   ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## #   ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## #   ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## #   ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## #   CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, …
str(test_innerjoin$COUNTYFIPS) #how many values? #3140
##  chr [1:2925] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#Leftjoin?
test_leftjoin<- left_join(Diabetes, Lowaccess_county, by = "COUNTYFIPS")
head(test_leftjoin)
## # A tibble: 6 × 26
##   COUNTYFIPS STATE…¹ STATE COUNTY REGION ACS_T…² ACS_M…³ ACS_P…⁴ ACS_M…⁵ ACS_P…⁶
##   <chr>      <chr>   <chr> <chr>  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 01001      01      Alab… Autau… South    55221    37.7    92.4   51281    6.14
## 2 01003      01      Alab… Baldw… South   195121    42.2    92.5   50254    6.36
## 3 01005      01      Alab… Barbo… South    26932    38.8    82.4   32964   13.5 
## 4 01007      01      Alab… Bibb … South    22604    38.9    91.7   38678    7.74
## 5 01009      01      Alab… Bloun… South    57710    40.7    92.3   45813    7.77
## 6 01011      01      Alab… Bullo… South    10678    39.3    82.0   31938   18.9 
## # … with 16 more variables: ACS_PCT_HH_INC_100000 <dbl>,
## #   ACS_PER_CAPITA_INC <dbl>, ACS_PCT_POV_AIAN <dbl>,
## #   ACS_MEDIAN_HOME_VALUE <dbl>, ACS_PCT_MEDICAID_ANY <dbl>,
## #   ACS_PCT_MEDICAID_ANY_BELOW64 <dbl>, ACS_PCT_MEDICARE_ONLY <dbl>,
## #   ACS_PCT_PRIVATE_ANY <dbl>, ACS_PCT_UNINSURED <dbl>, CEN_AIAN_NH_IND <chr>,
## #   ACS_TOT_HH <dbl>, ACS_AVG_HH_SIZE <dbl>, ACS_PCT_CHILD_1FAM <dbl>,
## #   CHR_PCT_ADULT_OBESITY <dbl>, CHR_PCT_DIABETES <dbl>, …
str(test_leftjoin$COUNTYFIPS) #how many values? #3234
##  chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
#As noted above, will likely need to start at the beginning to understand why county counts, countyfips counts vary between these two dataframes

#Checking for any evident differences in output between two approaches, at least in headers of each
#Mean PCT LA by county
head(test_innerjoin$MEAN_PCT_LA_COUNTY)
## [1] 63.24917 42.07161 31.10111  1.32000 11.06556 71.36667
head(test_leftjoin$MEAN_PCT_LA_COUNTY)
## [1] 63.24917 42.07161 31.10111  1.32000 11.06556 71.36667
#Do not appreciate any differences

#PCT Diabetes
head(test_innerjoin$CHR_PCT_DIABETES)
## [1] 14.2 11.3 18.0 14.9 14.3 20.0
head(test_leftjoin$CHR_PCT_DIABETES)
## [1] 14.2 11.3 18.0 14.9 14.3 20.0
#Do not appreciate any differences
#Loading libraries, county geography data in prep for exploratory data analysis, including choropleths:

library(rgdal)
## Warning: package 'rgdal' was built under R version 4.2.2
## Loading required package: sp
## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/McGreevJ/AppData/Local/Programs/R/R-4.2.1/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.2.2
library(ggplot2)
library(RColorBrewer)
library(plyr)
## Warning: package 'plyr' was built under R version 4.2.2
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
library(rgdal)
library(sf)
## Warning: package 'sf' was built under R version 4.2.2
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(tidycensus)
## Warning: package 'tidycensus' was built under R version 4.2.2
library(ggsn)
## Warning: package 'ggsn' was built under R version 4.2.2
## Loading required package: grid
library(leaflet)
library(ggplot2)
library(RColorBrewer)

counties<- readRDS(gzcon(url("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/uscounties_2010.rds")))

st_crs(counties) <- st_crs(counties)
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
#using base plot function to check that counties contains polygon data: 
plot(counties)

#using ggplot to check that counties contains polygon data:
ggplot() +
    geom_sf(data = counties)

#Joining combined data set (test_leftjoin for now, noting I still need to determine if that was the correct join) and counties, but first, need to create a common variable by which to join. Will convert State to a 2 digit number in Counties, then mutate State and County into a 5 digit number, a mutated variable in Counties called "COUNTYFIPS" to correspond to "COUNTYFIPS" in the combined data set, test_leftjoin.

counties$STATE<- str_pad(counties$STATE, width=2, side="left", pad="0")
head(counties)
## Simple feature collection with 6 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS:  WGS 84
##           GEO_ID STATE COUNTY     NAME   LSAD CENSUSAREA
## 1 0500000US01001    01    001  Autauga County    594.436
## 2 0500000US01009    01    009   Blount County    644.776
## 3 0500000US01017    01    017 Chambers County    596.531
## 4 0500000US01021    01    021  Chilton County    692.854
## 5 0500000US01033    01    033  Colbert County    592.619
## 6 0500000US01045    01    045     Dale County    561.150
##                         geometry
## 1 MULTIPOLYGON (((-86.49677 3...
## 2 MULTIPOLYGON (((-86.5778 33...
## 3 MULTIPOLYGON (((-85.18413 3...
## 4 MULTIPOLYGON (((-86.51734 3...
## 5 MULTIPOLYGON (((-88.13999 3...
## 6 MULTIPOLYGON (((-85.41644 3...
counties<- counties %>% dplyr::mutate(COUNTYFIPS = paste(STATE, COUNTY, sep = ""))
head(counties)
## Simple feature collection with 6 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -88.13999 ymin: 31.19518 xmax: -85.12342 ymax: 34.8922
## Geodetic CRS:  WGS 84
##           GEO_ID STATE COUNTY     NAME   LSAD CENSUSAREA
## 1 0500000US01001    01    001  Autauga County    594.436
## 2 0500000US01009    01    009   Blount County    644.776
## 3 0500000US01017    01    017 Chambers County    596.531
## 4 0500000US01021    01    021  Chilton County    692.854
## 5 0500000US01033    01    033  Colbert County    592.619
## 6 0500000US01045    01    045     Dale County    561.150
##                         geometry COUNTYFIPS
## 1 MULTIPOLYGON (((-86.49677 3...      01001
## 2 MULTIPOLYGON (((-86.5778 33...      01009
## 3 MULTIPOLYGON (((-85.18413 3...      01017
## 4 MULTIPOLYGON (((-86.51734 3...      01021
## 5 MULTIPOLYGON (((-88.13999 3...      01033
## 6 MULTIPOLYGON (((-85.41644 3...      01045
str(counties$COUNTYFIPS) #3109 values
##  chr [1:3109] "01001" "01009" "01017" "01021" "01033" "01045" "01051" ...
str(test_leftjoin$COUNTYFIPS) #2927 values, #why the difference?
##  chr [1:2927] "01001" "01003" "01005" "01007" "01009" "01011" "01013" ...
test_counties_left<- left_join(counties, test_leftjoin, by = 'COUNTYFIPS')
str(test_counties_left) #3109 obs of 33 variables
## Classes 'sf' and 'data.frame':   3109 obs. of  33 variables:
##  $ GEO_ID                      : Factor w/ 3221 levels "0500000US01001",..: 1 5 9 11 17 23 26 33 40 42 ...
##  $ STATE.x                     : chr  "01" "01" "01" "01" ...
##  $ COUNTY.x                    : Factor w/ 325 levels "001","003","005",..: 1 6 13 16 23 30 34 43 53 55 ...
##  $ NAME                        : Factor w/ 1909 levels "Abbeville","Acadia",..: 89 174 309 341 387 460 567 730 986 1009 ...
##  $ LSAD                        : Factor w/ 8 levels "Borough","CA",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ CENSUSAREA                  : num  594 645 597 693 593 ...
##  $ COUNTYFIPS                  : chr  "01001" "01009" "01017" "01021" ...
##  $ STATEFIPS                   : chr  "01" "01" "01" "01" ...
##  $ STATE.y                     : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ COUNTY.y                    : chr  "Autauga County" "Blount County" "Chambers County" "Chilton County" ...
##  $ REGION                      : chr  "South" "South" "South" "South" ...
##  $ ACS_TOT_POP_WT              : num  55221 57710 34079 43819 54444 ...
##  $ ACS_MEDIAN_AGE              : num  37.7 40.7 43 38.8 42.2 36.8 37.9 40.2 41.7 39 ...
##  $ ACS_PCT_EMPLOYED            : num  92.4 92.3 91.1 90.8 91 ...
##  $ ACS_MEDIAN_HH_INC           : num  51281 45813 34177 41627 40576 ...
##  $ ACS_PCT_HH_INC_10000        : num  6.14 7.77 10.63 9.22 10.35 ...
##  $ ACS_PCT_HH_INC_100000       : num  19.3 12.88 8.52 12.55 11.68 ...
##  $ ACS_PER_CAPITA_INC          : num  24974 20532 21071 21399 22546 ...
##  $ ACS_PCT_POV_AIAN            : num  52.2 6.6 36.84 6.33 18.24 ...
##  $ ACS_MEDIAN_HOME_VALUE       : num  141300 119800 80800 100100 99400 ...
##  $ ACS_PCT_MEDICAID_ANY        : num  12.3 15.4 19.9 16 15.1 ...
##  $ ACS_PCT_MEDICAID_ANY_BELOW64: num  13.2 16.8 21.3 17.6 16.7 ...
##  $ ACS_PCT_MEDICARE_ONLY       : num  3.86 5.82 7.51 6.28 5.76 3.9 3.54 5.69 5.23 4.2 ...
##  $ ACS_PCT_PRIVATE_ANY         : num  59.2 58.4 50.6 53.8 58.6 ...
##  $ ACS_PCT_UNINSURED           : num  10.1 11.6 13.9 16 11.1 ...
##  $ CEN_AIAN_NH_IND             : chr  "0" "0" "0" "0" ...
##  $ ACS_TOT_HH                  : num  20396 20816 13787 16237 22204 ...
##  $ ACS_AVG_HH_SIZE             : num  2.68 2.74 2.44 2.68 2.43 2.56 2.65 2.53 2.46 2.64 ...
##  $ ACS_PCT_CHILD_1FAM          : num  22.3 25.2 50.4 26.7 31.1 ...
##  $ CHR_PCT_ADULT_OBESITY       : num  37.5 34.4 40.3 35.3 32.5 37.3 36.1 42 35 33.6 ...
##  $ CHR_PCT_DIABETES            : num  14.2 14.3 17 14.7 17.6 15.6 14.6 16.1 14.9 12.4 ...
##  $ MEAN_PCT_LA_COUNTY          : num  63.2 11.1 36 11.2 54.3 ...
##  $ geometry                    :sfc_MULTIPOLYGON of length 3109; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:10, 1:2] -86.5 -86.7 -86.8 -86.9 -86.9 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:32] "GEO_ID" "STATE.x" "COUNTY.x" "NAME" ...
test_counties_inner<- inner_join
str(test_counties_inner) #2893 obs of 33 variables
## function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., keep = FALSE)
#Exploratory data analysis - sorting data, top/bottom rates and corresponding counties

#Starting with low access...

colnames(test_counties_left)
##  [1] "GEO_ID"                       "STATE.x"                     
##  [3] "COUNTY.x"                     "NAME"                        
##  [5] "LSAD"                         "CENSUSAREA"                  
##  [7] "COUNTYFIPS"                   "STATEFIPS"                   
##  [9] "STATE.y"                      "COUNTY.y"                    
## [11] "REGION"                       "ACS_TOT_POP_WT"              
## [13] "ACS_MEDIAN_AGE"               "ACS_PCT_EMPLOYED"            
## [15] "ACS_MEDIAN_HH_INC"            "ACS_PCT_HH_INC_10000"        
## [17] "ACS_PCT_HH_INC_100000"        "ACS_PER_CAPITA_INC"          
## [19] "ACS_PCT_POV_AIAN"             "ACS_MEDIAN_HOME_VALUE"       
## [21] "ACS_PCT_MEDICAID_ANY"         "ACS_PCT_MEDICAID_ANY_BELOW64"
## [23] "ACS_PCT_MEDICARE_ONLY"        "ACS_PCT_PRIVATE_ANY"         
## [25] "ACS_PCT_UNINSURED"            "CEN_AIAN_NH_IND"             
## [27] "ACS_TOT_HH"                   "ACS_AVG_HH_SIZE"             
## [29] "ACS_PCT_CHILD_1FAM"           "CHR_PCT_ADULT_OBESITY"       
## [31] "CHR_PCT_DIABETES"             "MEAN_PCT_LA_COUNTY"          
## [33] "geometry"
#Inspecting a trimmed-down dataframe
selection<- test_counties_left %>% dplyr::select(COUNTY.y, STATE.y, MEAN_PCT_LA_COUNTY)
View(selection)
#Mean low access percentage
mean_LA<- test_counties_left %>% dplyr::summarise(meanLA = mean(MEAN_PCT_LA_COUNTY, na.rm = TRUE))
mean_LA #mean low access was 38.21%
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  WGS 84
##     meanLA                       geometry
## 1 38.21385 MULTIPOLYGON (((-96.83003 2...
#Median low access percentage
median_LA<- test_counties_left %>% dplyr::summarise(medianLA = median(MEAN_PCT_LA_COUNTY, na.rm = TRUE))
median_LA #median low access was 37.18%
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  WGS 84
##   medianLA                       geometry
## 1  37.1827 MULTIPOLYGON (((-96.83003 2...
#How many counties have 100% of population with low access?

y<-test_counties_left$MEAN_PCT_LA_COUNTY  

counttoplowaccess<-length(which(y == 100))
counttoplowaccess #37 counties have 100% of the population with low access to food
## [1] 37
#Lowest access counties

Lowest_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% slice_max(MEAN_PCT_LA_COUNTY) 
Lowest_access_counties
## Simple feature collection with 37 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -120.4952 ymin: 29.77719 xmax: -79.31228 ymax: 48.99902
## Geodetic CRS:  WGS 84
## First 10 features:
##    COUNTYFIPS  REGION      STATE.y         COUNTY.y MEAN_PCT_LA_COUNTY
## 1       20075 Midwest       Kansas  Hamilton County                100
## 2       20203 Midwest       Kansas   Wichita County                100
## 3       30059    West      Montana   Meagher County                100
## 4       30107    West      Montana Wheatland County                100
## 5       40043   South     Oklahoma     Dewey County                100
## 6       38007 Midwest North Dakota  Billings County                100
## 7       41069    West       Oregon   Wheeler County                100
## 8       48359   South        Texas    Oldham County                100
## 9       08061    West     Colorado     Kiowa County                100
## 10      16025    West        Idaho     Camas County                100
##                          geometry
## 1  MULTIPOLYGON (((-101.5675 3...
## 2  MULTIPOLYGON (((-101.5671 3...
## 3  MULTIPOLYGON (((-110.2819 4...
## 4  MULTIPOLYGON (((-109.6543 4...
## 5  MULTIPOLYGON (((-99.38102 3...
## 6  MULTIPOLYGON (((-103.6092 4...
## 7  MULTIPOLYGON (((-120.3714 4...
## 8  MULTIPOLYGON (((-103.0424 3...
## 9  MULTIPOLYGON (((-102.0452 3...
## 10 MULTIPOLYGON (((-114.3749 4...
View(Lowest_access_counties)
#In what counties do more than 75% of the population have low food access?

yseventyfive<- test_counties_left$MEAN_PCT_LA_COUNTY
countseventyfive<- length(which(yseventyfive > 75))
countseventyfive #In 126 counties greater than 75% of the population has low food access
## [1] 126
Topquartile<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% filter(MEAN_PCT_LA_COUNTY > 75)
Topquartile
## Simple feature collection with 126 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS:  WGS 84
## First 10 features:
##    COUNTYFIPS  REGION  STATE.y          COUNTY.y MEAN_PCT_LA_COUNTY
## 1       08014    West Colorado Broomfield County           81.58400
## 2       13059   South  Georgia     Clarke County           75.34867
## 3       08111    West Colorado   San Juan County           99.24000
## 4       20049 Midwest   Kansas        Elk County           99.36000
## 5       20075 Midwest   Kansas   Hamilton County          100.00000
## 6       20203 Midwest   Kansas    Wichita County          100.00000
## 7       31163 Midwest Nebraska    Sherman County           83.43000
## 8       31171 Midwest Nebraska     Thomas County           99.94000
## 9       32009    West   Nevada  Esmeralda County           97.19000
## 10      32021    West   Nevada    Mineral County           86.83500
##                          geometry
## 1  MULTIPOLYGON (((-105.1473 3...
## 2  MULTIPOLYGON (((-83.36003 3...
## 3  MULTIPOLYGON (((-107.7383 3...
## 4  MULTIPOLYGON (((-96.52569 3...
## 5  MULTIPOLYGON (((-101.5675 3...
## 6  MULTIPOLYGON (((-101.5671 3...
## 7  MULTIPOLYGON (((-98.74853 4...
## 8  MULTIPOLYGON (((-100.8461 4...
## 9  MULTIPOLYGON (((-117.691 38...
## 10 MULTIPOLYGON (((-117.691 38...
View(Topquartile)
#What regions have the lowest food access?
Topregions<- Topquartile %>% dplyr::summarise(count(REGION)) 
Topregions
## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS:  WGS 84
##           x freq                       geometry
## 1   Midwest   48 MULTIPOLYGON (((-96.81783 3...
## 2 Northeast    1 MULTIPOLYGON (((-96.81783 3...
## 3     South   38 MULTIPOLYGON (((-96.81783 3...
## 4      West   39 MULTIPOLYGON (((-96.81783 3...
Topregions_summary<- dplyr::arrange(Topregions, desc(freq))
Topregions_summary
## Simple feature collection with 4 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -123.7279 ymin: 26.7174 xmax: -71.2248 ymax: 49.00069
## Geodetic CRS:  WGS 84
##           x freq                       geometry
## 1   Midwest   48 MULTIPOLYGON (((-96.81783 3...
## 2      West   39 MULTIPOLYGON (((-96.81783 3...
## 3     South   38 MULTIPOLYGON (((-96.81783 3...
## 4 Northeast    1 MULTIPOLYGON (((-96.81783 3...
#Ranked by number of counties with >75% of population with low food access:Midwest with 48 counties; West with 39 counties, South with 38 counties, and Northeast with 1 county, for 126 counties total 
#How many counties have 0% of population with low access (that is, they are high food access)?

y1<-test_counties_left$MEAN_PCT_LA_COUNTY  
countleastlowaccess<-length(which(y1 == 0))
countleastlowaccess #29 counties have 0% of the population with low access to food
## [1] 29
#Highest access counties (0% of population is low access)
Highest_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% slice_min(MEAN_PCT_LA_COUNTY) 
Highest_access_counties
## Simple feature collection with 29 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -105.6903 ymin: 31.43369 xmax: -76.24528 ymax: 45.21074
## Geodetic CRS:  WGS 84
## First 10 features:
##    COUNTYFIPS  REGION        STATE.y          COUNTY.y MEAN_PCT_LA_COUNTY
## 1       08047    West       Colorado     Gilpin County                  0
## 2       17035 Midwest       Illinois Cumberland County                  0
## 3       37103   South North Carolina      Jones County                  0
## 4       47057   South      Tennessee   Grainger County                  0
## 5       48379   South          Texas      Rains County                  0
## 6       48425   South          Texas  Somervell County                  0
## 7       13119   South        Georgia   Franklin County                  0
## 8       13147   South        Georgia       Hart County                  0
## 9       28031   South    Mississippi  Covington County                  0
## 10      26019 Midwest       Michigan     Benzie County                  0
##                          geometry
## 1  MULTIPOLYGON (((-105.3978 3...
## 2  MULTIPOLYGON (((-88.01212 3...
## 3  MULTIPOLYGON (((-77.47372 3...
## 4  MULTIPOLYGON (((-83.66746 3...
## 5  MULTIPOLYGON (((-95.66539 3...
## 6  MULTIPOLYGON (((-97.86486 3...
## 7  MULTIPOLYGON (((-83.35527 3...
## 8  MULTIPOLYGON (((-82.99222 3...
## 9  MULTIPOLYGON (((-89.39918 3...
## 10 MULTIPOLYGON (((-86.2335 44...
View(Highest_access_counties)
#Need to check this...some likely rural South counties where I am surprised that the county would be 0% low access
#How many counties are missing values for low access?
y2<-test_counties_left$MEAN_PCT_LA_COUNTY  
countmissinglowaccess<-length(which(is.na(y2)))
countmissinglowaccess #216 counties are missing values for low access
## [1] 216
#Missing access counties (by COUNTYFIPS)
Missing_access_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, MEAN_PCT_LA_COUNTY) %>% dplyr::filter(is.na(MEAN_PCT_LA_COUNTY))
View(Missing_access_counties)
#Why are county names missing? #How can I get county names (back)?
#Continuing exploratory data analysis with diabetes...

#Inspecting a trimmed-down dataframe
selection_dm<- test_counties_left %>% dplyr::select(COUNTY.y, STATE.y, CHR_PCT_DIABETES)
View(selection_dm)
#Mean percentage of population with diabetes
mean_DM<- test_counties_left %>% dplyr::summarise(meanDM = mean(CHR_PCT_DIABETES, na.rm = TRUE))
mean_DM #Mean percentage of population with diabetes was 11.57%
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  WGS 84
##     meanDM                       geometry
## 1 11.57276 MULTIPOLYGON (((-96.83003 2...
#Median percentage of population with diabetes
median_DM<- test_counties_left %>% dplyr::summarise(medianDM = median(CHR_PCT_DIABETES, na.rm = TRUE))
median_DM #Median perentage of population with diabetes was 11.4%
## Simple feature collection with 1 feature and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  WGS 84
##   medianDM                       geometry
## 1     11.4 MULTIPOLYGON (((-96.83003 2...
#What were the top 100 U.S. counties in terms of greatest percentage of population with diabetes in 2015?

Diabetes_high<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% slice_max(CHR_PCT_DIABETES, n = 100)
View(Diabetes_high)
#What regions had the greatest number of counties in the top 100 in terms of percentage of population with diabetes in 2015?

Topregions_DM<- Diabetes_high %>% dplyr::summarize(count(REGION) %>% dplyr::arrange(desc(freq)))
View(Topregions_DM)
#South and Midwest. #95 of the top 100 counties in terms of highest percentages of diabetes are in the South. 5 are in the Midwest.
#What are the 100 counties with the lowest percentages of their populations with diabetes?

Diabetes_low<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% slice_min(CHR_PCT_DIABETES, n = 100)
View(Diabetes_low)  
#What regions are represented by the bottom 100 counties, those counties that have the lowest percentage of population with diabetes? 

Bottomregions_DM<- Diabetes_low %>% dplyr::summarize(count(REGION) %>% dplyr::arrange(desc(freq)))
View(Bottomregions_DM)
#West (64 counties), Midwest (25 counties), Northeast (9 counties), South (8 counties)
#How many counties are missing values for diabetes prevalence?

y_dm<-test_counties_left$CHR_PCT_DIABETES  
countmissingdm<-length(which(is.na(y_dm)))
countmissingdm #216 counties are missing values for diabetes prevalence
## [1] 216
#Missing access counties (by COUNTYFIPS)

Missing_dm_counties<- test_counties_left %>% dplyr::select(COUNTYFIPS, REGION, STATE.y, COUNTY.y, CHR_PCT_DIABETES) %>% dplyr::filter(is.na(CHR_PCT_DIABETES))
View(Missing_dm_counties)  
#Why are county names not showing? Need to figure this out.
#Exploratory data analysis - potential choropleths and/or a leaflet
library(RColorBrewer)
myPalette <- colorRampPalette(brewer.pal(9, "BuPu"))


my_theme <- function() {
  theme_minimal() +                                  
  theme(axis.line = element_blank(),                 
        axis.text = element_blank(),                 
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  
        legend.key.size = unit(0.5, "cm"),          
        legend.text = element_text(size = 10),       
        legend.title = element_text(size = 12),
        plot.title = element_text(size = 16))      
}  
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd")) 


#colors
library(leaflet)
# Select a color palette with which to run the palette function
pal_fun <- colorNumeric("BuPu", NULL)       # Blue-Purple from RColorBrewer
pal_fun2 <- colorNumeric("YlOrRd", NULL)    # Yellow-Orange-Red from RColorBrewer
pal_fun3 <- colorNumeric("viridis", NULL)   # viridis from viridis
pal_fun4 <- colorNumeric("inferno", NULL)   # inferno from viridis
pal_fun5 <- colorNumeric("inferno", NULL, reverse=TRUE)  # reverses the color ramp

#The following are some options - welcome any suggestions for making them prettier, less clunky, noting scalebar overlaying the choropleth in at least once case!

#Low access plot, using left join, pink/white
ggplot() +
  geom_sf(data = test_counties_left, aes(fill = MEAN_PCT_LA_COUNTY), lwd = 0) +
  my_theme() +
  ggtitle("Percent of population with low access to food, by U.S. county, 2015, left, pink scheme") +
  scale_fill_continuous(low = "antiquewhite2", high = "red", guide = "colorbar") +
  north(x.min = -75.28031, y.min = 39.86747,    # add north bar with north()
        x.max = -74.95575, y.max = 40.13793,    # set map boundaries with x.min, y.min, etc..      
        symbol=12,                              # select north arrow symbol
        anchor = c(x = -75, y = 39.93)) +       # set north bar location
  scalebar(x.min = -75.28031, y.min = 39.86747, # add scalebar with scalebar()
           x.max = -74.95575, y.max = 40.13793, # set map boundaries
           dist = 5, dist_unit = 'km',          # set scalebar segment length = 5km
           transform = TRUE,                    # TRUE for decimal degree coordinates
           model = "WGS84")                      # specify CR

#OR....
  
#Low access plot, using left join, blue:
ggplot() +
  geom_sf(data = test_counties_left, aes(fill = 
        MEAN_PCT_LA_COUNTY), lwd = 0) + 
        my_theme() +
        ggtitle("Percent of population with low access to food, by U.S. county, 2015, left, blue scheme") +
        theme(legend.position = "topright")

#Starter choropleth for diabetes....

#Diabetes plot, using left join, blue:
ggplot() +
  geom_sf(data = test_counties_left, aes(fill = CHR_PCT_DIABETES), lwd = 0) +
  my_theme() +
  ggtitle("Percent of population with diabetes, by U.S. county, 2015, left, blue theme")+
  theme(legend.position = "none")

#OR...

#Alternative to create an interactive diabetes map:

#Popup message
pu_message <- paste0(test_counties_left$COUNTY.y, ", ", test_counties_left$STATE.y, "<br>Diabetes prevalence 2015: ", round(test_counties_left$CHR_PCT_DIABETES, 1), "%")


#Interactive map with legend and scalebar
leaflet(test_counties_left) %>%
  addPolygons(fillColor = ~pal_fun4(CHR_PCT_DIABETES),   
  popup = pu_message) %>%             
  addTiles() %>% addLegend("bottomright", pal=pal_fun4,        values = ~CHR_PCT_DIABETES,
            title = '2015 Diabetes prevalence rates',
            opacity = 1) %>% addScaleBar() 
##
#Scatterplot, correlation

#Which, if any, of the following are worth keeping? Suggestions welcome.

library(modelsummary)
## 
## Attaching package: 'modelsummary'
## The following object is masked from 'package:tidycensus':
## 
##     get_estimates
#Scatterplot
ggplot(test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
    geom_point(color = "darkseagreen3") 
## Warning: Removed 216 rows containing missing values (geom_point).

#Correlation

ggplot(data = test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
    geom_point(color = "darkseagreen3") + ggtitle("Scatterplot: Degree of low access vs. Diabetes prevalence")
## Warning: Removed 216 rows containing missing values (geom_point).

  #Pearson
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs") 
## [1] -0.2911943
  #Spearman
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs", method = "spearman") #Rank-based correlation
## [1] -0.3106256
cor(test_counties_left$MEAN_PCT_LA_COUNTY, test_counties_left$CHR_PCT_DIABETES, use = "complete.obs") 
## [1] -0.2911943
#Logistic model
LA_DM_lm_fit <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
LA_DM_lm_fit
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
## 
## Coefficients:
##      (Intercept)  CHR_PCT_DIABETES  
##            67.96             -2.57
summary_LA_DM_lm_fit <- summary.lm(LA_DM_lm_fit) #Summary of results
summary_LA_DM_lm_fit
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.510 -16.022  -0.798  14.647  71.366 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       67.9600     1.8608   36.52   <2e-16 ***
## CHR_PCT_DIABETES  -2.5704     0.1571  -16.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.44 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.08479,    Adjusted R-squared:  0.08448 
## F-statistic: 267.9 on 1 and 2891 DF,  p-value: < 2.2e-16
names(summary_LA_DM_lm_fit) #We can retrieve output from summary statistics for model fit
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"
summary_LA_DM_lm_fit$adj.r.squared
## [1] 0.08447756
coef(LA_DM_lm_fit) #We recover the intercept and coefficient for x that are expected
##      (Intercept) CHR_PCT_DIABETES 
##        67.959951        -2.570355
confint(LA_DM_lm_fit) #Confidence intervals for fit
##                      2.5 %    97.5 %
## (Intercept)      64.311420 71.608483
## CHR_PCT_DIABETES -2.878302 -2.262408
ggplot(test_counties_left, aes(x = MEAN_PCT_LA_COUNTY, y = CHR_PCT_DIABETES)) +
    geom_point(color = "darkseagreen3") + 
    geom_smooth(method = "lm", color = "green4")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 216 rows containing non-finite values (stat_smooth).
## Warning: Removed 216 rows containing missing values (geom_point).

modelsummary(LA_DM_lm_fit, statistic = "p.value") #Nicer table of output
Model 1
(Intercept) 67.960
(0.000)
CHR_PCT_DIABETES −2.570
(0.000)
Num.Obs. 2893
R2 0.085
R2 Adj. 0.084
AIC 25950.8
BIC 25968.7
Log.Lik. −12972.415
RMSE 21.44
summary(LA_DM_lm_fit)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES, data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.510 -16.022  -0.798  14.647  71.366 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       67.9600     1.8608   36.52   <2e-16 ***
## CHR_PCT_DIABETES  -2.5704     0.1571  -16.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.44 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.08479,    Adjusted R-squared:  0.08448 
## F-statistic: 267.9 on 1 and 2891 DF,  p-value: < 2.2e-16
#How best to interpret these findings? As food access becomes more limited in a county, prevalence of diabetes decreases?
#Assessing other variables that may have a relationship to diabetes prevalence, beyond low access to food...

#Median household income
ggplot(test_counties_left, aes(x = ACS_MEDIAN_HH_INC, y = CHR_PCT_DIABETES)) +
  geom_point(color = "sienna4") + ggtitle("Median household income relationship to diabetes prevalence")
## Warning: Removed 216 rows containing missing values (geom_point).

summary((lm(CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_left)))
## 
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_MEDIAN_HH_INC, data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3549 -1.3533 -0.0076  1.3629  7.1082 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.713e+01  1.582e-01  108.30   <2e-16 ***
## ACS_MEDIAN_HH_INC -1.180e-04  3.254e-06  -36.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.105 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.3127, Adjusted R-squared:  0.3125 
## F-statistic:  1315 on 1 and 2891 DF,  p-value: < 2.2e-16
#Per capita income
ggplot(test_counties_left, aes(x = ACS_PER_CAPITA_INC, y = CHR_PCT_DIABETES)) +
  geom_point(color = "sienna4") + ggtitle("Per capita income relationship to diabetes prevalence")
## Warning: Removed 216 rows containing missing values (geom_point).

summary((lm(CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_left)))
## 
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PER_CAPITA_INC, data = test_counties_left)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.748 -1.393  0.043  1.371  7.391 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.742e+01  1.754e-01   99.30   <2e-16 ***
## ACS_PER_CAPITA_INC -2.390e-04  6.987e-06  -34.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.143 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.2881, Adjusted R-squared:  0.2879 
## F-statistic:  1170 on 1 and 2891 DF,  p-value: < 2.2e-16
#Obesity
ggplot(test_counties_left, aes(x = CHR_PCT_ADULT_OBESITY, y = CHR_PCT_DIABETES)) +
  geom_point(color = "sienna4") +
  ggtitle("Obesity rate relationship to diabetes prevalence") 
## Warning: Removed 216 rows containing missing values (geom_point).

summary((lm(CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_left)))
## 
## Call:
## lm(formula = CHR_PCT_DIABETES ~ CHR_PCT_ADULT_OBESITY, data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0388 -1.2559 -0.1007  1.1966  7.1952 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.350080   0.249183  -1.405     0.16    
## CHR_PCT_ADULT_OBESITY  0.372427   0.007706  48.330   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.889 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.4469, Adjusted R-squared:  0.4467 
## F-statistic:  2336 on 1 and 2891 DF,  p-value: < 2.2e-16
#Uninsured status
ggplot(test_counties_left, aes(x = ACS_PCT_UNINSURED, y = CHR_PCT_DIABETES)) +
  geom_point(color = "sienna4") + ggtitle("Uninsured status relationship to diabetes prevalence")
## Warning: Removed 216 rows containing missing values (geom_point).

summary((lm(CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_left)))
## 
## Call:
## lm(formula = CHR_PCT_DIABETES ~ ACS_PCT_UNINSURED, data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9412 -1.6379 -0.0419  1.5832  8.8885 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.963767   0.124819   79.83   <2e-16 ***
## ACS_PCT_UNINSURED 0.122181   0.008819   13.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 2891 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.06225,    Adjusted R-squared:  0.06193 
## F-statistic: 191.9 on 1 and 2891 DF,  p-value: < 2.2e-16
#Combined linear models

#Including median income
lm_medincome <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC, data = test_counties_left)
summary(lm_medincome)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC, 
##     data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.409 -15.728  -0.633  14.607  70.812 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.360e+01  3.611e+00  14.843  < 2e-16 ***
## CHR_PCT_DIABETES  -2.081e+00  1.888e-01 -11.023  < 2e-16 ***
## ACS_MEDIAN_HH_INC  1.847e-04  3.985e-05   4.636  3.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.37 on 2890 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.09155,    Adjusted R-squared:  0.09092 
## F-statistic: 145.6 on 2 and 2890 DF,  p-value: < 2.2e-16
#Including per capita income
lm_percapincome <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PER_CAPITA_INC, data = test_counties_left)
summary(lm_percapincome)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PER_CAPITA_INC, 
##     data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.404 -15.747  -0.585  14.607  72.604 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.274e+01  3.672e+00  14.361  < 2e-16 ***
## CHR_PCT_DIABETES   -2.093e+00  1.854e-01 -11.284  < 2e-16 ***
## ACS_PER_CAPITA_INC  3.964e-04  8.256e-05   4.801 1.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.36 on 2890 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.09203,    Adjusted R-squared:  0.09141 
## F-statistic: 146.5 on 2 and 2890 DF,  p-value: < 2.2e-16
#Including obesity
lm_obesity <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + CHR_PCT_ADULT_OBESITY, data = test_counties_left)
summary(lm_obesity)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + CHR_PCT_ADULT_OBESITY, 
##     data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.785 -15.979  -0.739  14.674  71.461 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            65.0454     2.8299  22.985   <2e-16 ***
## CHR_PCT_DIABETES       -2.7633     0.2111 -13.087   <2e-16 ***
## CHR_PCT_ADULT_OBESITY   0.1608     0.1176   1.367    0.172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.44 on 2890 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.08539,    Adjusted R-squared:  0.08475 
## F-statistic: 134.9 on 2 and 2890 DF,  p-value: < 2.2e-16
#Including uninsured status
lm_uninsured <- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PCT_UNINSURED, data = test_counties_left)
summary(lm_uninsured)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_PCT_UNINSURED, 
##     data = test_counties_left)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.72 -15.84  -0.62  14.49  72.13 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       66.31668    1.94594  34.080  < 2e-16 ***
## CHR_PCT_DIABETES  -2.68548    0.16198 -16.579  < 2e-16 ***
## ACS_PCT_UNINSURED  0.22596    0.07932   2.849  0.00442 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.42 on 2890 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.08736,    Adjusted R-squared:  0.08673 
## F-statistic: 138.3 on 2 and 2890 DF,  p-value: < 2.2e-16
#How about a model that accounts for all of the above variables?
#Trying...
lm_combinedvars<- lm(MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC + ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, data = test_counties_left)
modelsummary(lm_combinedvars, statistic = "p.value")
Model 1
(Intercept) 26.533
(0.000)
CHR_PCT_DIABETES −2.526
(0.000)
ACS_MEDIAN_HH_INC 0.000
(0.425)
ACS_PER_CAPITA_INC 0.001
(0.000)
CHR_PCT_ADULT_OBESITY 0.481
(0.000)
ACS_PCT_UNINSURED 0.565
(0.000)
Num.Obs. 2893
R2 0.107
R2 Adj. 0.105
AIC 25889.2
BIC 25931.0
Log.Lik. −12937.617
RMSE 21.18
lm_combinedvars
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC + 
##     ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, 
##     data = test_counties_left)
## 
## Coefficients:
##           (Intercept)       CHR_PCT_DIABETES      ACS_MEDIAN_HH_INC  
##             2.653e+01             -2.526e+00              5.599e-05  
##    ACS_PER_CAPITA_INC  CHR_PCT_ADULT_OBESITY      ACS_PCT_UNINSURED  
##             6.314e-04              4.812e-01              5.646e-01
summary(lm_combinedvars)
## 
## Call:
## lm(formula = MEAN_PCT_LA_COUNTY ~ CHR_PCT_DIABETES + ACS_MEDIAN_HH_INC + 
##     ACS_PER_CAPITA_INC + CHR_PCT_ADULT_OBESITY + ACS_PCT_UNINSURED, 
##     data = test_counties_left)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.147 -15.437  -0.443  14.237  72.625 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.653e+01  5.520e+00   4.807 1.61e-06 ***
## CHR_PCT_DIABETES      -2.526e+00  2.300e-01 -10.982  < 2e-16 ***
## ACS_MEDIAN_HH_INC      5.599e-05  7.014e-05   0.798 0.424811    
## ACS_PER_CAPITA_INC     6.314e-04  1.555e-04   4.062 5.00e-05 ***
## CHR_PCT_ADULT_OBESITY  4.812e-01  1.249e-01   3.853 0.000119 ***
## ACS_PCT_UNINSURED      5.646e-01  8.965e-02   6.297 3.49e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.2 on 2887 degrees of freedom
##   (216 observations deleted due to missingness)
## Multiple R-squared:  0.1065, Adjusted R-squared:  0.105 
## F-statistic: 68.86 on 5 and 2887 DF,  p-value: < 2.2e-16
#How to properly interpret these findings? #How to state conclusions appropriately?

#The following are code-writing reference materials (from practicums, homeworks), not part of final project… ############################################################## #From reference, taken from practicum 5 - does this material help at all to interpret the impact of other variables such as median income, obesity etc.? # We might try a combined model to see whether the change in fev1 associated with fvc changes by sex, or whether there is an interaction between fvc and sex. Recall from lecture that the + sex term allows for a potential change in intercept in the line that models fev1 ~ fvc for a given sex, while the + fvc:sex term allows for a potential change in slope that models the fev1 ~ fvc relationship for a given sex.

lm1 <- lm(fev1 ~ fvc + sex, data = nhanes) lm2 <- lm(fev1 ~ fvc + sex + fvc:sex, data = nhanes) modelsummary(list(lm1, lm2), coef_omit = “Intercept”, statistic = c(“p = {p.value}”), estimate = “{estimate} [{conf.low}, {conf.high}]”) ################################################################

##############################################################
#Extra content from homework #2 for reference, possible use:

#* How many subjects had wheezing or whistling in their chest in the past year?
{r, eval = TRUE}
x<-nhanesdata$RDQ070
count1<- length(which(x==1))
count1

#* How many adults are part of the study (with adults defined as age of 18 years or greater)?
{r, eval = TRUE}
y<-nhanesdata$RIDAGEYR
countadult<- length(which(y>= 18))
countadult

#From homework #3
#    + How many flights in the dataset had destination of Miami International Airport (MIA)? Which airport had the most departing flights to MIA?
{r, eval=TRUE}
library(nycflights13)
table(flights$dest == "MIA")
    

table(flights$dest == 'MIA', flights$origin)

+ Of the flights that departed from JFK and LGA with time spent in the air greater than two hours, how many had tail numbers (tailnum) containing JB? Which airlines (carrier) flew these flights, and how many of these flights corresponded to each carrier?

{r, eval = TRUE}
library(dplyr) listone<- dplyr::select(flights, -year, - month, -day, -dep_time, -sched_dep_time, -dep_delay, -arr_time, -sched_arr_time, -arr_delay, -flight, -dest, -distance, -hour, -minute) listtwo<- dplyr::filter(listone, origin == ‘JFK’ | origin == ‘LGA’, air_time >120) %>% group_by(tailnum, carrier) listthree<- dplyr::filter(listtwo, (grepl(‘JB’, tailnum))) %>% group_by(carrier) listthree %>% count(carrier)

#    + List the top 10 destinations based on having the most flights to each, along with the number of flights to each destination. Next, list the destination with the most flights per month, along with the corresponding number of flights for each month.
{r, eval=TRUE}
flights %>% count(dest)
destcount<- flights %>% count(dest)
head(destcount)
dplyr::arrange(destcount, desc(n))

flfreq<- flights %>% dplyr::group_by(month, dest) %>% dplyr::summarise(destfreq= n())
flfreq
arrange(flfreq, desc(destfreq))
top_1<- flfreq %>% dplyr::slice_max(destfreq, n = 1)
top_1
+ Was the mean arrival delay (arr_delay) per carrier related to the total number of flights per carrier? The answer should show a plot and use one sentence to address this question qualitatively. That is, there is no need to create a model to formally evaluate a relationship.

{r, eval = TRUE} library(ggplot2) mean_arr_delay<- flights %>% dplyr::group_by(carrier) %>% summarise(mean_delay = mean(arr_delay, na.rm = TRUE)) mean_arr_delay

tot_fl_carrier<- flights %>% dplyr::select(carrier, flight) %>% group_by(carrier) %>% summarise(totflights = n()) tot_fl_carrier

fulljoin <- dplyr::full_join(mean_arr_delay, tot_fl_carrier) fulljoin

p <- ggplot(fulljoin, aes(x = totflights, y = mean_delay)) + geom_point() + geom_smooth(method= “lm”, color = “red”) p ########################################################### END OF DRAFT ###